Deciphering complex breakage-fusion-bridge genome rearrangements with Ambigram

Breakage-fusion-bridge (BFB) is a complex rearrangement that leads to tumor malignancy. Existing models for detecting BFBs rely on the ideal BFB hypothesis, ruling out the possibility of BFBs entangled with other structural variations, that is, complex BFBs. We propose an algorithm Ambigram to identify complex BFB and reconstruct the rearranged structure of the local genome during the cancer subclone evolution process. Ambigram handles data from short, linked, long, and single-cell sequences, and optical mapping technologies. Ambigram successfully deciphers the gold- or silver-standard complex BFBs against the state-of-the-art in multiple cancers. Ambigram dissects the intratumor heterogeneity of complex BFB events with single-cell reads from melanoma and gastric cancer. Furthermore, applying Ambigram to liver and cervical cancer data suggests that the BFB mechanism may mediate oncovirus integrations. BFB also exists in noncancer genomics. Investigating the complete human genome reference with Ambigram suggests that the BFB mechanism may be involved in two genome reorganizations of Homo Sapiens during evolution. Moreover, Ambigram discovers the signals of recurrent foldback inversions and complex BFBs in whole genome data from the 1000 genome project, and congenital heart diseases, respectively.


Finding BFB candidate SV sets
This is a clustering method to identify the sets of SVs that may be involved in the same CSV event like BFB. Given a large data set consisting of miscellaneous SVs, we define each SV as a pair of chromosome names and breakpoints, denoted (chr 1 , bkp 1 , chr 2 , bkp 2 ), where bkp 1 , bkp 2 ∈ N and chr 1 and chr 2 are the chromosomes that the two breakpoints belong to, respectively. The distance of two SVs is defined by the least absolute value among four pairs of breakpoints' differences, and each SV contributes one breakpoint to every pair. Note that the difference is taken as infinity if two breakpoints belong to different chromosomes. Generally, we use the breadth-first search algorithm to cluster SVs into groups based on the SV distance that should be less than a predefined distance α. Besides, the range of an SV group is defined as the difference between the largest breakpoint and smallest breakpoint among all SVs in the group. We also set a range limit β so that each resultant SV set has a range less than the limit.

Algorithm 1 Cluster SVs into SV sets.
1: Sort all SVs in ascending order of breakpoints; 2: Let Q be a queue and put an ungrouped SV into Q; 3: Let S be an empty SV set; 4: while Q is not empty do

5:
Fetch the first SV u from Q, and add u into S; 6: for each ungrouped SV v do 7: if distance of u and v ≤ α then 8: if range of S including v ≤ β then 9: Add v into S; 10: end if 11: end if 12: end for 13: end while 14: Get a SV set S; 15: if some SVs are ungrouped then 16: Go to line 2; 17: end if

Formalating BFB as a DAG
Based on the definitions of mono-chains, loops, and BFB paths in Methods, we can conclude the following properties: Lemma 1.1 Removing a loop from a BFB path will produce another BFB path.
We denote a BFB path with a loop as P = e(a 1 , b 1 )|e(a 2 , b 2 )|...|e(a n , b n ), where e(a k , b k ) is a loop. Since a loop is a symmetric entity, the entities flanking a loop are either reverse complements or a pair of parent and child entities. Hence, the entities e(a k−1 , b k−1 ) and e(a k+1 , b k+1 ) share a pair of reverse complementary segments linked by an FBI junction. After removing the loop, the path becomes P = e(a 1 , b 1 )|...|e(a k−1 , b k−1 )|e(a k+1 , b k+1 )|...|e(a n , b n ), which still keeps the continuity and palindromic suffix. Therefore, the resultant path P is still a BFB path.

Integer linear programming
Apart from the ILP objective function and two ILP constraints that define CN differences, we incorporate additional domain knowledge to refine the ILP results and meet some special requirements. There are several constraints that guarantee the connectivity of the output entities so that all of them can be integrated into a BFB DAG for constructing BFB paths.
Firstly, we denote a set of segments involved in a BFB event as S = s 1 , s 2 , ..., s n . For any mono-chain m(i, j), we define the set of its child mono-chains as P i,j m = {m(a, b) a > i, b = j or a = i, b < j}. According to Lemma 1.2, any non-loop BFB path consists of length-decreasing mono-chains with the parent-child relationship. Therefore, each mono-chain has at most one child mono-chain. Hence, we have the following ILP constraint to guarantee that no mono-chain has more than one child mono-chains.
Besides, for any mono-chain m(i, j), we define the set of its parent mono-chains as P i,j m = {m(a, b) a < i, b = j or a = i, b > j}. According to Lemma 1.2, we have the following inequality that guarantees the mono-chain (except the reference path) has at least one parent mono-chain. As a result, the child mono-chain has at least a predecessor to follow and becomes a valid vertex in a BFB DAG.
Similarly, for any loop l(i, j), we denote the set of parent entities as P i,j e = {e(a, b) a < i, b = j or a = i, b > j} since a loop can follow either a parent mono-chain or a parent loop in a BFB path. Hence, we have another formula that guarantees every loop has at least one parent entity. Therefore, the child loop can be a successor of a parent mono-chain or inserted into a parent loop, composing a BFB DAG.

Single-cell mode
We also provide a method to reconstruct BFB paths from single-cell data. Users can input CN profiles and SV information of multiple subclones evolving over time. Ambigram will add ILP constraints on common monochains and loops, which are shared by all the subclones. mono-chains and loops, respectively. Furthermore, our algorithm will add the errors into the objective function (Formula 1 in Methods), which considers the CN differences of mono-chains and loops shared by subclones, denoted by δ m and δ l , respectively. As a result, we can reconstruct several BFB paths with some similar parts. Here are the objective function (Formula 4) and ILP constraints (Formula 5 and 6) and for the Single-cell mode: −ε l ≤ c l (a, b) − c l (a, b) ≤ ε l , ∀l(a, b) ∈ L (6) 1.6 Extra information from linked reads, long reads, and optical mapping alignment Since linked reads, long reads, and optical mapping alignment data provide extra information, our algorithm can use the linkage information from them, which indicates possible connections among segments, to construct more convincing BFB paths in terms of the observed CN and SV information. We represent the extracted linkage information by a set of entities E = e(i, j) i, j ∈ {1, 2, ..., n} , where s i s i+1 ...s j is a linked genome sequence indicated by linkage information. Then we constrain the CN of entities in E in the ILP Formula 7 to improve the probability that these entities appear in the output as parts of BFB paths. As a result, the linkage information can help our algorithm construct BFB paths that better fit the real scenario.

Compose a BFB path by connecting entities in the topological order
We use recursion and backtracking to find all topological orders from a BFB DAG (Supplementary Methods, Algorithm 2). Then we follow the first topological order in the result to construct a BFB path ( Supplementary  Fig. 45). There is a temporary BFB path P that will be extended by an entity in each iteration. Suppose we get m entities in a topological order E = [e(a 1 , b 1 ), e(a 2 , b 2 ), ..., e(a m , b m )], we fetch the first entity from E and initialize the temporary BFB path as P = e(a 1 , b 1 ). Then we iteratively fetch an entity from E and add it into an appropriate position following its parent in P . For each entity, we start searching for the position from the tail of the path P . During the process, the neighboring entities are a pair of parent and child, and P always keeps a palindromic suffix. Eventually, if all entities in E are added into P , the final BFB path consisting of E is derived.
Algorithm 2 Find all topological orders. 1: Initialize all entities in the BFB DAG as unvisited; 2: Initialize an emtpy ordered list of entiteis E; 3: Add an entity e with indegree = 0 into E, and set it as visited; 4: Decrease indegree of child entities of e by 1; 5: Recursively run line 3 until all entities have 0 indegrees, and mark E as a topological order; 6: Remove e from E, reset it as unvisited, 7: Iecrease indegree of child entities of e by 1, and go to line 3; 8: Return all topological orders; Algorithm 3 Compose a BFB path in the topological order. 1: Sort all entities in E by a topological order; 2: Fetch the first entity in E and initialize P as the entity; 3: for each entity e ∈ E do 4: Find an appropriate position for inserting e in P 5: Add e into the position in P ; 6: Remove e from E; 7: end for 8: Return P ;

Algorithm of the BFB-TRX mode
After constructing BFB paths in local genome regions, we design an algorithm to concatenate two or more BFB paths with translocation. Given the segment set S = s 1 , s 2 , ..., s n , we define translocation by a pair of segments (s i , s j ), where i, j ∈ {1, 2, ..., n}. Note that s i can be either segment s i or reverse complementary segment s i . Then we group m translocation in a set T = (s i , s j ) k |∀k ∈ {1, 2, ..., m}, i, j ∈ {1, 2, ..., n} . Moreover, we sort all translocation in T so that s j ∈ (s i , s j ) a and s i ∈ (s i , s j ) a+1 belong to the same BFB path, for ∀a ∈ {1, 2, ..., m − 1}. As a result, we iteratively concatenate BFB paths by translocation in T .

Algorithm 4 Concatenate BFB paths with translocation.
1: Fetch the first translocation (s i , s j ) 1 in T ; 2: Find BFB paths P 1 and P 2 that are connected by (s i , s j ) 1 ; 3: Cut P 1 at the last position of s i and P 2 at the first position of s j ; 4: Link truncated P 1 and P 2 as P ; 5: for each translocation (s i , s j ) k ∈ T do 6: Set P 1 = P and find P 2 that s j belongs to. 7: Cut P 1 at the last position of s i and P 2 at the first position of s j ; 8: Link truncated P 1 and P 2 as P ; 9: end for 10: Return P ;

Algorithm of T2T alignment
We design an algorithm to align partial sequences on the T2T complete human genome efficiently. This algorithm is based on dynamic programming and supports both forward and reverse alignment. Given two genomic sequences G 1 and G 2 that consist of four letters "A", "T", "C", and "G" representing four bases, the algorithm can find the longest common subsequence (LCS) present in both of them. The LCS of G 1 and G 2 appear in the same relative order but are not necessarily continuous. As for reverse alignment, two genomic sequences are reverse complements if their bases hold a complementary mapping relationship, which is "A" corresponds to "T" and "C" corresponds to "G". To check if two sequences are reversely aligned, we try to find the longest palindrome subsequence (LPS) of both sequences. The LPS is a subsequence that appears as a reverse complement in reverse relative order in two sequences but is not necessarily continuous. To find the LPS of two input sequences G 1 and G 2 , we first convert G 2 into its reverse complement and then use the same method for finding LCS to get the LPS of both sequences. For both LCS and LPS, We evaluate the similarity between two input sequences by the length of LCS or LPS divided by the average length of both input sequences. If the similarity is higher than 0.8, we consider the two sequences are valid matches.
Algorithm 5 Align two genome sequences.

Remarks of Ambigram
To elaborate on this part better, we define the number of segments as n and the total number of mono-chains and loops as m. According to the definition of mono-chains and loops, each of them is composed of two segments. Therefore, the numbers of mono-chain candidates and loop candidates are equal. Then the total number is m = 2 × n(n+1) 2 = n(n + 1).

Searching space
The searching space is proportional to the total number of possible BFB paths. Based on the algorithms explained above, the permutations of loops that are inserted into a non-loop BFB path determine the searching space. In the worst case, all possible mono-chains are used to construct a non-loop BFB path, i.e., s 1 s 2 ...s n |s n s n−1 ...s 2 |s 2 s 3 ...s n−1 |...|s n/2 . Given n(n+1) 2 loops, each loop has 2 positions for insertion, so we have O(n 2 ) possible BFB paths. Therefore, the search space is O(n 2 ).

Time complexity
There are several parts that cost the most running time: (1) Solve the copy numbers of entities (mono-chains and loops) with ILP that includes constraints between entities -O(n

Space complexity
Most space is used to record entities and all topological orders. According to the search space, the upper bound of the number of topological orders is O(n 2 ) = O(m). As a result, the space complexity is O(m).   Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (d) Ambigram resolved the BFB path. Overall, chr3 undergoes four BFB cycles. The first BFB cycle occurs when segment H6 is fused with its reverse complement on the chromatid duplication. Then the second BFB cycle occurs when the breakpoint on reverse segment H1 is fused with the left breakpoint on segment H1 on the sister chromatid. Moreover, the third BFB cycle occurs when the breakage and fusion happen at reverse segment H3 and segment H2 on its reverse counterpart. Finally, the last cycle occurs when the FBI connects segment H4 to reverse segment H5, which contributes to a stable state and indicates the end of this BFB event.       Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (d) Ambigram resolved the BFB path. We interpret this BFB with two stages. In the first stage, segment v2 on virus is inserted between segment H3 and segment H4 on chr8. In the second stage, the virus-integrated chromosome undergoes two BFB cycles. The first BFB cycle occurs when segment H4 is fused with its reverse complement on the chromatid duplication. Then the second BFB cycle occurs when reverse segment H2 is fused with its complement on the sister chromatid. Finally, the breakage on reverse segment H2 contributes to a stable state. (a-f) Results derived by Ambigram for simulated BFB instances 1-6 with various sequencing protocols, depths, and purities. "-" means that the inputs of Ambigram are SVs called from one sequencing protocol (SV=PE, SV=10x, SV=PB, or SV=ONT) and ground truth CNs. "Resolved" means all SVs and CNs from the inferred BFB path are matched with those of ground truths, otherwise "Unresolved". "Resolved by inferring the virtual FBI" signifies Ambigram resolves the BFB path by recovering the undetected FBIs in low sequencing depth and tumor purity scenarios. "Resolved by utilizing read linkage" means that Ambigram cannot resolve the BFB path with CNs and detected SVs, while it can resolve the path after incorporating the linked or long read linkage from 10x, PB, or ONT data. The SV precision measures the portion of the SVs inferred correctly among predictions, that is, the number of ground truth SVs inferred correctly by the tool over the total number of inferred SVs. The SV F1-score is the harmonic mean of SV precision and SV recall. Note that "Resolved" implies CN accuracy = 1, SV precision = 1, SV recall = 1, and SV F1-score= 1.      Supplementary Figure 9: Benchmarking BFBFinder with in silico data (a-f) The BFB paths from BFBFinder of simulated instances 1-6. The number n in a circle denotes that the SV comes from chromosome duplication in the n-th BFB cycle. (g-r) Results derived by BFBFinder for simulated BFB instances 1-6 with various sequencing depths and purities. "-" means that the inputs of BFBFinder are SVs called from one sequencing protocol and ground truth CNs. "Resolved" means all SVs and CNs from the inferred BFB path are matched with those of ground truths, otherwise "Unresolved". "Resolved by inferring the virtual FBI" signifies BFBFinder resolves the BFB path by recovering the undetected FBIs in low sequencing depth and tumor purity scenarios. "Resolved by utilizing read linkage" means that BFBFinder cannot resolve the BFB path with CNs and detected SVs, while it can resolve the path after incorporating the linked or long read linkage from 10x, PB, or ONT data. The CN accuracy is measured by the number of segments with correctly inferred CNs divided by the total segment number. The SV precision measures the portion of the SVs inferred correctly among predictions, that is, the number of ground truth SVs inferred correctly by the tool over the total number of inferred SVs. The SV recall measures the portion of the SVs inferred correctly among ground truths, that is, the number of ground truth SVs inferred correctly by the tool over the total number of ground truth SVs. The SV F1-score is the harmonic mean of SV precision and SV recall. Note that an instance is resolved only if CN accuracy, SV precision, SV recall, and SV F1-score are 1. BFB: breakage-fusion-bridge. FBI-hh: fold-back inversion with head-to-head direction. FBI-tt: fold-back inversion with tail-to-tail direction. The SV precision measures the portion of the SVs inferred correctly among predictions, that is, the number of ground truth SVs inferred correctly by the tool over the total number of inferred SVs. The SV recall measures the portion of the SVs inferred correctly among ground truths, that is, the number of ground truth SVs inferred correctly by the tool over the total number of ground truth SVs. The SV F1-score is the harmonic mean of SV precision and SV recall. Box plots indicate the median (middle line), 25th, 75th percentile (box), and 5th and 95th percentile (whiskers) as well as outliers (single points). The whole complex BFB event is fully illustrated in Fig. 3 and this supplementary figure. This figure mainly illustrates the first stage, when chr3 and chr6 encounter six and four BFB cycles, respectively. The first BFB cycle on chr3 occurs when a sister chromatid is replicated, and segment H3 8 is fused with its reverse complement. Then the second BFB cycle starts when the double-strand breaks off at reverse segment H3 3, which leads to instability of chr3. A duplication is reproduced, and reverse segment H3   Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (d) LINX resolved the BFB path. The first BFB cycle starts when the breakage happens at reverse complementary segment H3 2. Then chr3 replicates its sister chromatid spanning from segment H3 3 to segment H3 8, and two sister chromatids are fused. Following that, the double-strand breaks off at segment H3 7, a sister chromatid is reproduced, and segment H3 7 is fused with reverse segment H3 6. Moreover, another breakage occurs at reverse segment H3 6, connected to H6 2 on chr6 with an inter-chromosomal rearrangement. Then the third BFB cycle happens when duplication and fusion occur at segment H6 2 and the reverse complementary segment H6 3. Furthermore, the fourth BFB cycle occurs when the double-strand breaks off at segment H3 4, and a sister chromatid is replicated. Finally, fragments of chr10 and chr12 are inserted between segment H3 4 and reverse segment H3 3, and a stable state is achieved after the final breakage occurs at reverse segment H3 3. Since chromosomal duplication follows these 4 BFB cycles through a whole genome doubling event, the final BFB path gets two copies. (e) The total CN error, CN accuracy, and SV recall derived by Ambigram, LINX, and BFBFinder compared to ground truth. The total CN error is the sum of all segment copy number differences between the output and ground truth. BFBFinder is marked with "*" as BFBFinder merely accepts segment CN profiles from a single chromosome, so we fit the ground truth CN profiles of BFB paths of chr3 and chr6 separately. The CN accuracy is measured by the number of segments with correctly inferred CNs divided by the total segment number. The SV recall measures the portion of the SVs inferred correctly among ground truths, that is, the number of ground truth SVs inferred correctly by the tool over the total number of ground truth SVs. Although BFBFinder had a small CN error, it failed to resolve the complex BFB event with translocation.    Figure 14: Results of COLO829 instances from BFBFinder. (a-b) The BFB paths from BFBFinder of COLO829 instances. (c-f) Results derived by BFBFinder for COLO829 cases with various sequencing depths and purities. "-" means that the inputs of BFBFinder are SVs called from one sequencing protocol and ground truth CNs. "Resolved" means all FBIs from the inferred BFB path are matched with those of ground truths, otherwise "Unresolved". "Resolved by inferring the virtual FBI" signifies BFBFinder resolves the BFB path by recovering the undetected FBIs in low sequencing depth and tumor purity scenarios. "Resolved by utilizing read linkage" means that BFBFinder cannot resolve the BFB path with CNs and detected SVs, while it can resolve the path after incorporating the linked or long read linkage from 10x, PB, or ONT data. The CN accuracy is measured by the number of segments with correctly inferred CNs divided by the total segment number. The SV precision measures the portion of the SVs inferred correctly among predictions, that is, the number of ground truth SVs inferred correctly by the tool over the total number of inferred SVs. The SV recall measures the portion of the SVs inferred correctly among ground truths, that is, the number of ground truth SVs inferred correctly by the tool over the total number of ground truth SVs. The SV F1-score is the harmonic mean of SV precision and SV recall. Note that an instance is resolved only if the FBI recall is 1. BFB: breakage-fusion-bridge. FBI-hh: fold-back inversion with head-to-head direction.  Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (d) Ambigram resolved the BFB path. We interpret this BFB with two stages. In the first stage, chr15 undergoes four BFB cycles. The first BFB cycle occurs when segment H15 7 is fused with its reverse complement on the chromatid duplication. Then the second BFB cycle occurs when reverse segment H15 1 is fused with its complement on the sister chromatid. Furthermore, the third BFB cycle occurs when the breakage happens at reverse segment H15 2. Following that, a sister chromatid is replicated, and segments H15 2 and H15 2 are fused. Finally, the fourth BFB cycle fuses segments H15 5-H15 6, and another breakage at reverse segment H15 6 contributes to a stable state that indicates the end of the first stage. In the second stage, an inter-chromosomal arrangement occurs on chr15 and chr6, inserting reverse segment H6 2 into the region between segments H15 3 and H15 5. Besides, another translocation links chr15 and chr20 by connecting segment H20 2 to reverse segment H15 2 and segment H15 3 on the BFB path of ch15, which leads to the final complex BFB path. (e) The total CN error, CN accuracy, and SV recall derived by Ambigram and BFBFinder compared to ground truth. The total CN error is the sum of all segment copy number differences between the output and ground truth. BFBFinder is marked with "*" as BFBFinder merely accepts segment CN profiles from a single chromosome, so we fit the ground truth CN profiles of BFB paths of chr15. The CN accuracy is measured by the number of segments with correctly inferred CNs divided by the total segment number. The SV recall measures the portion of the SVs inferred correctly among ground truths, that is, the number of ground truth SVs inferred correctly by the tool over the total number of ground truth SVs. Although BFBFinder had a small CN error, it failed to resolve the complex BFB event with translocation.

Resolved
Resolved    (a-b) Results derived by Ambigram for COLO829 instances 1-2 with various sequencing protocols, depths, and purities. "-" means that the inputs of Ambigram are SVs called from one sequencing protocol (SV=PE, SV=10x, SV=PB, or SV=ONT) and ground truth CNs. "Resolved" means that inferred BFB path includes all ground truth FBIs, otherwise "Unresolved". "Resolved by inferring the virtual FBI" signifies Ambigram resolves the BFB path by recovering the undetected FBIs in low sequencing depth and tumor purity scenarios. "Resolved by utilizing read linkage" means that Ambigram cannot resolve the BFB path with CNs and detected SVs, while it can resolve the path after incorporating the linked or long read linkage from 10x, PB, or ONT data. The CN accuracy is measured by the number of segments with correctly inferred CNs divided by the total segment number. The FBI precision measures the portion of the FBIs inferred correctly among predictions, that is, the number of ground truth FBIs inferred correctly by the tool over the total number of inferred FBIs. The FBI recall measures the portion of the FBIs inferred correctly among ground truths, that is, the number of ground truth FBIs inferred correctly by the tool over the total number of ground truth FBIs. The FBI F1-score is the harmonic mean of FBI precision and FBI recall. Note that an instance is resolved only if FBI recall = 1.     Supplementary Figure 18: The BFB event in a lung cancer sample HCC827 [2]. (a) The SV breakpoints split the local genome region of chr7 into five segments. The stairstep plot shows the ground truth CNs of segments and the CN derived by Ambigram and a combination of AA and AR. (b) List of SVs and the segments they connect. The head-to-head (hh) and tail-to-tail(tt) FBIs are colored light orange and light green, respectively. The reverse complementary segments have a red border. The SV called from PE sequencing data is marked with a red circle, while SVs from OM data are marked with blue circles. (c) CIRCOS diagram of the complex BFB. The outermost track shows the local genome regions involved with the BFB event. The second outermost track illustrates the input region CNs, and the third track indicates the resolved CNs by Ambigram. Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event.
(d) Ambigram resolved the BFB path with all FBIs. In this BFB event, chr7 undergoes four BFB cycles. The first cycle occurs when a sister chromatid is replicated, and reverse segment H1 is fused with its complement H1; then, the second BFB cycle fuses segments H4 and H4; moreover, the third cycle starts when breakage occurs on reverse segment H2, and segments H2 is fused with H2; finally, the fourth BFB cycle fuses reverse segment H3 and its reverse complement, and another breakage on segment H3 results in a stable BFB path on chr7. The final BFB path is the same as the result of [2].    Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (e) Ambigram infers the BFB event occurs in two stages. In the first stage, chr1 undergoes two BFB cycles. The first BFB cycle occurs when a sister chromatid is replicated, and segment H5 is fused with its reverse complement. Then the second BFB cycle starts when the double-strand breaks off at reverse segment H4. A duplication is reproduced, and reverse segment H4 is fused with segment H2. In the second stage, the HBV segments V3 and V4 are inserted into the area between reverse segment H4 and segment H2 through HBV integration, which indicates the end of this complex BFB event. (f) FuseSV resolved local genomic map. Note that (a-c) and (e) are modified from the original paper of FuseSV [6].
HBV integration

V6-H3+
Major Allele: ( 3 copy)  Figure 30: Complex BFB involving HBV integration and genes RCC2 and ARHGEF10L on chr1 of HCC 260T [5,6]. (a) The VIT breakpoints split the local genome region into H1, H2, and H3 segments. The light green denotes the read depth of each base pair, and the grey box shows the average CN of segments. The middle layer shows gene annotation. (b) The VIT and SV breakpoints split the virus genome into 11 segments (V1 -V11). The top layer shows functional annotation. (c) List of VITs, SVs, and the segments they connected with. The head-to-head (hh) and tail-to-tail(tt) FBIs are colored light orange and light green, respectively. The reverse complementary segments have a red border. (d) CIRCOS diagram of the complex BFB. The outermost track shows the local genome regions involved with the BFB event. The second outermost track illustrates the input region CNs, and the third track indicates the resolved CNs by Ambigram. Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (e) Ambigram resolved the BFB path. First, segment H1 is connected to segment V3 through HBV integration. Then the HBV-integrated genome sequence undergoes three BFB cycles. The first cycle starts when the breakage happens at segment V7. A sister chromatid is reproduced, and segment V7 is fused with reverse segment V 10. Then the second BFB cycle occurs when a sister chromatid is replicated, and segment H1 is fused with its reverse complement. Finally, the third BFB cycle fuses segments V3 and V 3. In the second stage, reverse segment H3 replaces the head of the BFB path, which indicates the end of this complex BFB event. (f) FuseSV resolved local genomic map. In sample 260 T, the HBV integrated upstream of the RCC2 gene on chr1 (triploid, minor CN=1). The local genomic map on chr1 contains virus inversion V4+ to V7 connected HBV DNA segments, which induced the deletion of the H2 host segment on the minor allele. The viral enhancer I/II and core promoter in the local genomic map might promote the expression of oncogene RCC2. Note that (a-c) and (e) are modified from FuseSV's original paper [6].   Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (e) Ambigram resolved the BFB path. Segment V10 is linked to reverse segment H1 on chr10 through HBV integration, and then the HBV-integrated genome sequence encounters three BFB cycles. Firstly, a sister chromatid is replicated, and reverse segment H1 is fused with its reverse complement. Besides, the second BFB cycle occurs on segments V4 and V 7. Finally, the third BFB cycle fuses reverse segment H1 and its reverse complement. (f) FuseSV resolved local genomic map. The HBV integrated at the ANK3 gene on chr10 (tetraploid, minor CN=2). Short inversed HBV DNA segments V9-V10 were included in the chr10 local genomic map and substituted the H2 host segment (intron of ANK3) in two allele copies. Note that (a-c) and (e) are modified from FuseSV's original paper [  (e) Ambigram resolved the BFB path. This complex BFB event consists of two stages. In the first stage, segment V8 is integrated with reverse segment H1 on chr5, and the HBV-integration genome sequence undergoes three BFB cycles. The first two cycles occur when chromosomal duplication happens, and reverse segments H1 and V 1 are fused with their reverse complements, respectively. Then the third BFB cycle starts when the breakage occurs at segment V2, and a sister chromatid is reproduced, followed by the fusion between segment V2 and reverse segment V 4 on the chromatid duplication. In the second stage, another HBV integration happens between reverse segment H3 and segment V5, replacing the head of the BFB path. (f) FuseSV detects the local genomic map of HBV integration at the heterozygous diploid chr5 of sample 261T, which consists of one copy of the normal allele (h1-H3), and one copy of the integration allele. Four short HBV segments (V5-V8) inversely integrate upstream of oncogene TERT, with one copy deletion of human segment H2. The virus enhancer-II and core promoter located in the inserted HBV segments may explain the up-regulated expression of TERT in sample 261T. Note that (a-c) and (e) are modified from FuseSV's original paper [6].  Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (e) Ambigram resolved the BFB path. This complex BFB event consists of two stages. In the first stage, segment H5 on chr13 is integrated with reverse segment V 1, and the HBV integration undergoes five BFB cycles. The first two cycles occur when chromosomal duplication happens, and reverse segments V 1 and H1 are fused with their reverse complements, respectively. Then the third BFB cycle starts when the breakage occurs at reverse segment H3. A sister chromatid is reproduced, followed by the fusion between reverse segment H3 and segment H3 on the chromatid duplication. Furthermore, the last two cycles fuse segments H5 and H5 with their reverse complements, respectively, indicating the end of the first stage. In the second stage, another HBV integration happens between reverse segment H5 and segment V3, replacing the tail of the BFB path.  ZNF432  ZNF528  ZNF600  ZNF665  ZNF331  CACNG8  LAIR1  KIR3DL3  TNNT1   ZNF841  ZNF578  ZNF320  VN1R2  DPRX  CACNG6  CDC42EP5  FCAR   ZNF616  ZNF808  ZNF321P  ZNF845  NLRP12  TSEN34  AC009892.10   ZNF836  ZNF701  ERVV-1  ZNF765  MYADM  LILRB5  LILRA1  NLRP2   PPP2R1A  ZNF83  ERVV-2  ZNF813  PRKCG  LILRB2  LILRB4  PPP1R12C   ZNF766  ZNF611  ZNF160  CTB-167G5.5  VSTM1  TTYH1  KIR3DL1   ZNF480  ZNF28  ZNF677  CACNG7  LILRA5  KIR2DL3   ZNF610  ZNF468  VN1R4  TARM1  LENG8   Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (e) Ambigram resolved the BFB path. In the complex BFB event, reverse segment H7 on chr19 is integrated with reverse segment V 2, and the HBV-integration local genomic map undergoes six BFB cycles. The first BFB cycle occurs when the breakage happens at reverse segment H2, which is fused with segment H1 on the sister chromatid. Then the second BFB cycle starts when breakage happens on segment H6 and the segment is fused with reverse segment H7 on the sister chromatid derived from chromosomal duplication. The third BFB cycle starts when another breakage occurs at reverse segment H4. A sister chromatid is reproduced, followed by the fusion between reverse segment H4 and segment H3 on the chromatid duplication. Moreover, the fourth BFB cycle occurs when another breakage happens at segment H6. Then segment H6 is fused with reverse segment H7 on the sister chromatid. Furthermore, the fifth cycle is similar to the first cycle, leading to reverse segment H2 being fused with segment H1 on the sister chromatid. Finally, another breakage occurs at segment H2, and a sister chromatid is replicated. The final fusion between segments H2 and H1 contributes to a stable state.  FKBP4  CCND2  NTF3  GAPDH  CLEC4C  PHC1  GABARAPL1  ETV6   FOXM1  TIGAR  ANO2  PTMS  SLC2A3  A2M  CLEC7A  PRB3   RHNO1  RAD51AP1  VWF  EMG1  NECAP1  PZP  AC068775.1  BCL2L14   ITFG2  FGF23  TNFRSF1A  DPPA3  A2ML1  CLEC1B  PRB4   NRIP2  C12orf4  PLEKHG6  NANOGNB  KLRB1  TAS2R10  MANSC1   TEX52  AC005833.1  SCNN1A  SLC2A14  CLEC2D  TAS2R14  LRP6   TULP3  FGF6  CD9  RBP5  CLEC6A  CLECL1  TAS2R50   TEAD4  DYRK4  TAPBPL  GDF3  M6PR  CLEC9A  PRB1   TSPAN9  NDUFA9  MRPL51  NANOG  KLRG1  OLR1  PRB2   PRMT8  AKAP3  NCAPD2  FOXJ2  CLEC2B

H1
Remaining part (same as the first stage) Supplementary Figure 42: Recurrent complex BFB involving chr7 of CHD sample SRR5114981. (a) The SV breakpoints split the local genome region into seven segments. The vertical lines above show the positions of SVs on chr7, and the middle layer shows gene annotation. The black box shows the average CN of segments. (b) List of SVs and the segments connected by them. The head-to-tail (ht) deletion and tail-to-tail (tt) FBIs are colored light blue and light green, respectively. The reverse complementary segments have a red border. The family tree on the right shows the inherited relationship between parents and the child, and the sample colored purple undergoes BFB events in the local region. Still, there is not any information about its parents in the dataset. (c) CIRCOS diagram of the complex BFB. The outermost track shows the local genome regions involved with the BFB event. The second outermost track illustrates the input region CNs, and the third track indicates the resolved CNs by Ambigram. Besides, the third track shows the resolved BFB paths, in which the circle and triangle points refer to the 5' end and 3' end, respectively. The innermost part represents all the SVs involved with the BFB event. (d) Ambigram resolved the BFB path. This complex BFB event consists of two stages. In the first stage, the local region on chr7 undergoes three BFB cycles. The first BFB cycle starts when reverse segment H1 is fused with segment H1. Then another breakage occurs at segment H6. A sister chromatid is reproduced, followed by the fusion between segment H6 and reverse segment H7 on the chromatid duplication. Finally, another breakage occurs at reverse segment H3, and a sister chromatid is replicated. Reverse segment H3 and segment H2 are fused, and the final breakage on segment H1 indicates the end of the first stage. In the second stage, another SV happens between segment H3 and segment H5, deleting segment H4 on the BFB path. 52     (a) Starting with a reference path, each BFB path is obtained by a sequence of BFB cycles that consist of fusion and breakage. In each BFB cycle, a local genomic map is fused with its sister chromatid by an FBI junction, and breakage happens at another FBI breakpoint, contributing to a BFB path. (b) Two consecutive FBIs s a |s a and s a |s a happen on the same segment s a . While the FBI breakpoints are different, we consider them as the same FBI junction (s, s a ) as they contribute equal CN increase to segment s a . (c) A BFB path can be equivalently represented by a BFB tree. A child mono-chain is the right child vertex of its parent mono-chain, while a loop is the left child vertex of its parent entity. Through the preorder traversal, entities in a BFB tree can compose a BFB path. Given sequencing data, we can construct a BFB DAG by connecting all pairs of parent and child entities. Since sequencing data cannot determine the exact position for each copy of a loop, a loop with multiple copies is collapsed into one vertex. A BFB tree is built upon the parent-child relationship between entities, so a BFB tree is a sub-tree in a BFB DAG. We can extract multiple BFB trees from a BFB DAG to derive several BFB paths.

Topological order 2: l(1,6) -> m(1,6) -> m(1,4) -> l(2,6) -> l(2,4)
BFB path 2:  Figure 45: One example of perfect BFB. (a) There are six segments with a stair-like CN pattern and four FBIs. Ambigram first list 21 mono-chains and 21 loops that are possible entities of a BFB path. Then Ambigram uses ILP to adjust errors between observed CNs of segments and estimated CNs of entities and find five entities (two mono-chains and three loops) with CN larger than 0. Finally, Ambigram constructs a DAG consisting of the five entities. (b) Based on the DAG, Ambigram finds two topological orders that derive two different BFB paths. Ambigram keeps a temporary path that the following entities will extend, starting with the first entity. In each step, Ambigram tries to put a new entity into an appropriate position on the temporary path (labeled by dashed yellow rectangles). When all the entities are integrated into one path, the algorithm will stop and output the resultant BFB path. Supplementary Figure 46: The illustration of SV junction type. (a) The intra-chromosome SV junction can be head-to-tail deletion (DEL-ht), tail-to-head duplication (DUPth), head-to-head fold-back inversion (FBI-hh), and tail-to-tail fold-back inversion (FBI-tt). (b) The interchromosome SV junction can be head-to-tail translocation (TRX-ht), tail-to-head translocation (TRX-th), head-to-head translocation (TRX-hh), and tail-to-tail translocation (TRX-tt).